Next-generation sequencing facilitates quantitative analysis of wild-type and Nrl(-/-) retinal transcriptomes.

PURPOSE
Next-generation sequencing (NGS) has revolutionized systems-based analysis of cellular pathways. The goals of this study are to compare NGS-derived retinal transcriptome profiling (RNA-seq) to microarray and quantitative reverse transcription polymerase chain reaction (qRT-PCR) methods and to evaluate protocols for optimal high-throughput data analysis.


METHODS
Retinal mRNA profiles of 21-day-old wild-type (WT) and neural retina leucine zipper knockout (Nrl(-/-)) mice were generated by deep sequencing, in triplicate, using Illumina GAIIx. The sequence reads that passed quality filters were analyzed at the transcript isoform level with two methods: Burrows-Wheeler Aligner (BWA) followed by ANOVA (ANOVA) and TopHat followed by Cufflinks. qRT-PCR validation was performed using TaqMan and SYBR Green assays.


RESULTS
Using an optimized data analysis workflow, we mapped about 30 million sequence reads per sample to the mouse genome (build mm9) and identified 16,014 transcripts in the retinas of WT and Nrl(-/-) mice with BWA workflow and 34,115 transcripts with TopHat workflow. RNA-seq data confirmed stable expression of 25 known housekeeping genes, and 12 of these were validated with qRT-PCR. RNA-seq data had a linear relationship with qRT-PCR for more than four orders of magnitude and a goodness of fit (R(2)) of 0.8798. Approximately 10% of the transcripts showed differential expression between the WT and Nrl(-/-) retina, with a fold change ≥1.5 and p value <0.05. Altered expression of 25 genes was confirmed with qRT-PCR, demonstrating the high degree of sensitivity of the RNA-seq method. Hierarchical clustering of differentially expressed genes uncovered several as yet uncharacterized genes that may contribute to retinal function. Data analysis with BWA and TopHat workflows revealed a significant overlap yet provided complementary insights in transcriptome profiling.


CONCLUSIONS
Our study represents the first detailed analysis of retinal transcriptomes, with biologic replicates, generated by RNA-seq technology. The optimized data analysis workflows reported here should provide a framework for comparative investigations of expression profiles. Our results show that NGS offers a comprehensive and more accurate quantitative and qualitative evaluation of mRNA content within a cell or tissue. We conclude that RNA-seq based transcriptome characterization would expedite genetic network analyses and permit the dissection of complex biologic functions.

Deep sequencing of RNA with NGS (called "RNA-seq") allows a comprehensive evaluation and quantification of all subtypes of RNA molecules expressed in a cell or tissue [16]. RNA-seq technology can detect transcripts expressed at low levels [17] and permit the identification of unannotated transcripts and new spliced isoforms [16,18]. The issues related to cross-hybridization and detection levels that limit the accuracy of gene expression estimates by microarray technology are not relevant to the data obtained with RNAseq [19]. Visualization of mapped sequence reads spanning the splice junctions can also reveal novel splice forms of annotated genes in the mouse retina, which was not possible with earlier hybridization-based technologies. With a steady reduction in the costs of NGS, RNA-seq is now emerging as a method of choice for comprehensive transcriptome profiling.
The vertebrate retina exhibits a highly organized laminar structure that captures, integrates, and transmits visual information to the brain for further processing. Photoreceptors constitute more than 70% of the retinal cells and convert light into electrical signals [20]. Rod photoreceptors mediate dim light vision and can detect a single photon of light, while cone photoreceptors are responsible for daylight vision, color perception, and visual acuity [21,22]. Impairment of photoreceptor function leads to retinal degeneration with a more common pattern of rod death preceding the death of cones [23][24][25]. The neural retina leucine zipper (Nrl) gene encodes a basic-motif leucine zipper transcription factor necessary for determining rod photoreceptor cell fate and functional maintenance [26]. The Nrl −/− mouse, generated by creating a loss of function mutation in Nrl, has a cone-only retina that serves as a useful model for studies of cone biology [26][27][28].
Several previous investigations have elucidated the gene expression landscape specific to whole retina or retinal cell types and during development or aging. Serial analysis of gene expression [29][30][31] and cDNA eye gene arrays [32][33][34][35][36] were initially used to determine signatures of retinal gene expression. Oligonucleotide microarrays have since allowed a more comprehensive approach to expression profiling [37][38][39][40][41]. Microarray analyses of flow-sorted photoreceptors and single cells from dissociated retinas [42][43][44] have begun to reveal new insights into regulatory networks. Application of NGS technology greatly expands the power of expression profiling by identifying all transcripts and spliced isoforms in the tissue or cell type of interest.
Here, we have used the power of NGS-based RNA-seq analysis to investigate in depth the transcriptome of wild-type (WT) and Nrl −/− retinas and identified a set of differentially expressed genes and spliced isoforms. We have also taken advantage of the knowledge about Nrl −/− mice to optimize workflows for data analysis and compared our results with those obtained with microarray methods and quantitative reverse transcription polymerase chain reaction (qRT-PCR) analysis. Our studies illustrate that RNA-seq offers a more complete, accurate, and relatively faster approach for comparative and comprehensive analysis of retinal transcriptomes and for discovering novel transcribed sequences. Our validated data analysis workflow should also be beneficial for similar studies by other investigators. Raw data and workflow are available on the N-NRL/NEI website.

METHODS
Animals and tissue collection: All investigations on mice were approved by the Animal Care and Use Committee of the National Eye Institute and followed the tenets of the Declaration of Helsinki. C57Bl/6J (referred to as wild type, WT) and Nrl −/− (on C57Bl/6J background [26]) mice were euthanized with CO2 inhalation. The retinas were excised rapidly, frozen on dry ice, and stored at −80 °C.
RNA isolation: Fresh frozen mouse retinas were lysed with a mortar and pestle in TRIzol Reagent, and total RNA was isolated according to the manufacturer's protocol (Invitrogen, Carlsbad, CA). RNA quality and quantity were assessed with the RNA 6000 Nano Kit (Agilent, Santa Clara, CA).
NGS library construction: Whole retinal RNA samples were independently processed from three wild-type and three Nrl −/− mice at P21. Total RNA (1 μg) was used with the TruSeq mRNA-seq Sample Preparation Kit (Illumina, San Diego, CA) to construct cDNA libraries. The quality of the libraries was verified using the DNA-1000 Kit (Agilent) and quantitation performed with qRT-PCR using ABI 7900HT (Life Technologies, Carlsbad, CA), as suggested in the Sequencing Library qRT-PCR Quantification Guide (Illumina). Gene Expression Master Mix (Life Technologies) was used for the qRT-PCR reactions, and a titration of phiX control libraries was employed as the quantification standard.
Illumina sequencing: Each cDNA library (10 pM) was independently loaded into one flow cell lane, and single-read cluster generation proceeded using the TruSeq SR Cluster Generation Kit v5 (Illumina). Sequencing-by-synthesis (SBS) of 70-nucleotide length was performed on a Genome Analyzer IIx running SCS2.8 software using SBS v4 reagents (Illumina). Base calling and chastity filtering were performed using RTA (real-time analysis with SCS2.8).
Burrows-Wheeler transform-based short read aligner analysis workflow: Burrows-Wheeler Transform Aligner (BWA) [45] was used to align RNA-seq reads against the mouse reference genome (build mm9), downloaded and indexed from the University of California Santa Cruz (UCSC) genome browser database [46]. The resulting sequence alignment/map files were imported into Partek Genomics Suite (Partek Inc., St. Louis, MO) to compute raw and fragments per kilobase of exon model per million mapped (FPKM) reads normalized expression values of the transcript isoforms defined in the UCSC refFlat file. A stringent filtering criterion of FPKM value 1.0 (equivalent to one transcript per cell [16]) in at least one out of six samples was used to obtain expressed transcripts. The FPKM values of the filtered transcripts were log-transformed using log2 (FPKM+offset) with an offset=1.0. ANOVA (ANOVA) was then performed on the log-transformed data of the two groups (WT and Nrl −/ − ) to generate fold change and p values for each transcript. Ychromosome transcripts were filtered out along with noncoding (nc) RNAs, mitochondrial DNA coded genes, pseudogenes, and predicted protein-coding genes. Differentially expressed mRNA isoforms were filtered for a fold change cutoff of 1.5 and p-value cutoff of 0.05. These criteria were implemented to enable a comparison with previous expression studies. Hierarchical clustering was performed using Cluster 3.0 software [47]. We used uncentered correlation as the distance metric. Heatmaps and dendrograms were generated using JavaTreeView software [48]. Aligned reads were visualized using the Integrated Genomics Viewer (IGV) [49].
TopHat/Cufflinks-based analysis workflow: Raw reads that passed the chastity filter threshold were mapped using TopHat [50] to identify known and novel splice junctions and to generate read alignments for each sample. Genomic annotations were obtained from Ensembl in gene transfer format (GTF). Splice junctions from the six samples were combined into a master junctions file that was used as an input file for the second iteration of TopHat mapping. The transcript isoform level and gene level counts were calculated and FPKM normalized using Cufflinks. An FPKM filtering cutoff of 1.0 in at least one of the six samples was used to determine expressed transcripts. Differential transcript expression was then computed using Cuffdiff. The resulting lists of differentially expressed isoforms were filtered and sorted into the following categories: protein coding mRNA transcripts and ncRNA transcripts. qRT-PCR analysis: Reverse transcription (RT) reactions were performed using oligo(dT)20 with SuperScript II reagents (Life Technologies) according to the manufacturer's protocol. cDNA synthesized from 2 μg of total RNA (1 μg for minus RT controls) was diluted to 100 μl (fivefold dilution), and from this 1 μl was used for each qRT-PCR reaction. The qRT-PCR reactions were performed in triplicate for TaqMan assays or in duplicate for the SYBR assays, using three biologic replicates per genotype, on a 7900HT Genetic Analyzer (Life Technologies). TaqMan assays were performed using TaqMan Gene Expression Master Mix and TaqMan Gene Expression Assays (Life Technologies) for genes listed in Table 1. The SYBR Green assays (Table 2) were performed using Power SYBR Green Master Mix (Life Technologies) and oligonucleotides at a final concentration of 200 nM. Oligonucleotides were designed using the Primer3 PCR Primer Design Tool [51] and synthesized by Integrated DNA Technologies (Coralville, IA). To eliminate complications due to contaminating genomic DNA in the RNA samples, qRT-PCR reactions were also performed with minus-RT control, using hypoxanthine guanine phosphoribosyl transferase (Hprt) primer pairs that can differentiate between mRNA and genomic DNA (data not shown). Differential expression analysis was performed using the ddCt method [52], with the geometric average of actin, beta (ActB) and Hprt as the endogenous controls [53].

Sequencing run summary:
Six libraries of P21 retinal cDNA (three each from WT and Nrl −/− ) were sequenced to obtain 35 to 49 million raw sequence reads per sample (Table 3). Of these, 75.8% to 82.7% reads passed the RTA chastity filter and were used for subsequent Burrows-Wheeler Aligner (BWA) and TopHat analysis workflows ( Figure 1). Due to TopHat workflow's power to map across splice junctions, the workflow consistently yielded 6 to 7 million more alignments per sample when compared to BWA.
BWA workflow: Based on the BWA analysis workflow, 16,014 transcripts were detected with a normalized FPKM value greater than 1.0 in any of the six samples. Transcripts were filtered based on whether they were mRNAs or ncRNAs. Of the 15,142 mRNA transcripts, only 1,422 met our criteria of differential expression of having a fold change greater than 1.5 and a p-value less than 0.05 (Table 4). Of the 1,422 differentially expressed mRNA transcripts (DETs) representing 1,218 unique genes, 551 were downregulated in Nrl −/− (including rod-specific genes) retinas, and 871 were  Each of the 3 week old WT and Nrl −/− retina sample was sequenced on a separate lane of the Illumina GAIIx flow cell to obtain 35 to 49 million raw reads. Over 75% of the raw reads passed Illumina's read chastity threshold to yield 29 to 37 million usable PF reads. TopHat mapping always gave significantly more alignments than BWA because of its ability to map across the splice junctions. A relatively smaller numbers of reads and alignments for WT samples 1 and 2 are not a matter of concern as FPKM normalization was used to assess the transcript isoform expression. WT=wild type. PF=pass filter Figure 1. Flowchart of RNA-seq data analysis methodology using Burrows-Wheeler Aligner (BWA) and TopHat. Schematic representation of two RNA-seq data analysis workflows and resulting views of the data generated. A: BWA workflow: Gapped alignments are performed using the BWA algorithm against the mouse reference genome build mm9, and estimation of the expression of genes at the transcript isoform level is performed by importing aligned reads into the Partek Genomics Suite using annotations provided by the University of California Santa Cruz (UCSC) refflat.txt file. Transcripts expressed at low levels in all samples (<1 fragments per kilobase of exon model per million mapped reads [FPKM]) are filtered out. Differential expression analysis was performed by applying the ANOVA (ANOVA) method, and the resulting list was sorted and filtered into different transcript groups. Clustering of rod and cone enriched genes was performed using Cluster 3.0 software (see Methods). B: TopHat workflow: Splice junction mapping was performed using the TopHat algorithm in two phases. In the first phase, splice junctions were detected de novo from the reads from each sample and combined to obtain a master splice junctions list. In the second phase of TopHat alignment, reads from each sample were re-aligned by providing the master junctions list as input. The two-phase mapping approach significantly increased the number of alignments spanning the splice junctions. Estimation of gene expression and differential expression were computed using Cufflinks, Cuffcompare, and Cuffdiff. Sorting and filtering of transcript isoforms were performed as in the BWA workflow. upregulated in Nrl −/− (including cone-enriched genes and those involved in retinal remodeling) retinas.
TopHat workflow: A total of 34,115 transcripts were detected with a normalized FPKM value of greater than 1.0 in any of the samples in either group. Transcripts were filtered based on whether they were protein-coding mRNAs or ncRNAs. Of the 32,001 mRNA transcripts, only 3,258 met our criteria of differential expression ( Table 4). The DETs represented 1990 unique genes; 1,504 were downregulated in Nrl −/− (including rod-specific genes) retinas, and 1,754 were upregulated in the Nrl −/− (including cone-enriched genes and those involved in retinal remodeling) retinas.

Comparison of the results from BWA and TopHat analyses:
The BWA/ANOVA and TopHat/Cufflinks analyses were compared to assess the consistency and quality of the results. Using the official Mouse Genome Informatics gene symbol as the linking term, Venn diagrams were constructed to summarize the overlap between the set of all (Figure 2A), the top 500 ( Figure 2B), and the top 200 ( Figure 2C) DETs from the BWA workflow and the DET list from the TopHat workflow. A comparison of the full list of BWA DETs to the TopHat list revealed only 51.7% overlap between the differentially expressed genes (DEGs) from BWA and TopHat. This overlap increased to 73.8% and 87.8% when only the top 500 and 200 DEGs from BWA, respectively, were considered. Subsequent analyses were performed using BWA data.  The BWA workflow employed refflat.txt annotation for mouse build mm9 from UCSC genome browser. The TopHat workflow employed GTF annotation for mouse build mm9 from the Ensembl database. After FPKM filtering (see Materials and Methods), transcribed features were classified as protein coding mRNAs and non-coding (nc) RNAs. The features classified as proteincoding mRNAs were further filtered based on fold change (≥1.5) and p-value (<0.05) to be considered significantly differentially expressed transcripts (DETs).

A comparison of RNA-seq and qRT-PCR analysis for DEGs:
Based on the RNA-seq data from the WT and Nrl −/− retinas, we selected 25 DEGs (12 downregulated and 13 upregulated) showing a wide range of differential expression for validation with qRT-PCR analysis. qRT-PCR data for all genes validated the RNA-seq results ( Figure 4). The WNT1 inducible signaling pathway protein 1 (Wisp1) TaqMan assay did not produce an amplicon in any of the experiments performed; subsequent examination of the RNA-seq data revealed that this assay did not correspond to the splice variant expressed in the retina. Additional analysis using a SYBR assay with oligonucleotides specific to the retinal splice variant confirmed the differential expression of Wisp1 (−43.9 fold change) in the Nrl −/− retina compared to the WT.

Expression levels of transcripts in the WT and Nrl −/− retina:
The preceding analysis clearly demonstrates the high reliability and accuracy of the data obtained with RNA-seq methodology. We therefore used RNA-seq data to derive absolute expression levels of individual transcripts. The top 25 genes highly expressed in the WT or Nrl −/− retina are listed in Table 6 and Table 7. As predicted, most of these genes encode proteins involved in photoreceptor function/ metabolism. Rod and cone photoreceptor enriched genes: We then focused on DEGs between the Nrl −/− and WT retinas. A total of 1,422 transcripts, corresponding to 1,218 unique genes, showed a minimum fold change of 1.5 at p≤0.05. Hierarchical clustering of all differentially expressed transcripts resulted in two distinct clusters: one cluster of 477 genes downregulated in the Nrl −/− retina includes all known rod-specific genes such as rhodopsin (Rho; FC=-4,804), guanine nucleotide binding protein, alpha transducing 1 (Gnat1; FC=-2,034), and nuclear receptor subfamily 2, group E, member 3 (Nr2e3; FC=-227.5; Figure 5A and Table 8); and the other cluster of 741 upregulated genes had all cone-specific genes such as opsin 1, short-wave-sensitive (Opn1sw; FC=18.4), cyclic nucleotide gated channel beta 3 (Cngb3; FC=18.1), and Gnat2 (FC=12.2; Figure 5B and Table 9).
We then compared our DEG list with two published studies that examined WT and Nrl −/− retinas: a recent transcript-level RNA-seq analysis that included 6,123 DETs [54] and a gene-level microarray analysis showing 438 DEGs [38] (Figure 6). To obtain the list of DEGs from the Mustafi et al. [55] data set, we performed ANOVA on their FPKM data from GEO database. Interestingly, the DEGs lists from the three studies had only 203 common genes including many previously identified genes specifically expressed in cone (fatty acid binding protein 7, brain [Fabp7], Opn1sw, Cngb3, and Gnat2) or rod (Rho, Gnat1, cyclic nucleotide gated channel alpha 1 [Cnga1], and Nr2e3) photoreceptors. To assess the power of RNA-seq to more comprehensively identify DETs than microarray, we examined the list of 634   Our RNA-seq data allowed us to identify 359 genes not identified in previous investigations. To further assess the quality of our analysis, we performed qRT-PCR validation of 15 genes identified by other studies (but not in our study) as differentially expressed and of 7 genes uniquely identified by our study (but not by other studies; Table 10). Of the 15 genes identified by other studies, only three (ATP-binding cassette, sub-family A (ABC1), member 13 [Abca13], CD8 antigen, alpha chain [Cd8a], and acyl-CoA oxidase-like [Acoxl]) were confirmed with qRT-PCR as being differentially expressed. We also detected these three as differentially expressed but had filtered them out because of FPKM values that were less than 1.0 in all samples. Interestingly, the Abca13 transcript detected in the retina had only sequence reads for exons 56 through 62. This finding was supported by qRT-PCR using two SYBR assays designed to exons 53/55 and exons 56/58. All seven genes uniquely identified by our study were validated as significantly differentially expressed.
The significantly lower number of DETs detected by our study compared to the Mustafi et al. study (2011; 1,422 versus 6,123, respectively) can be attributed to the following: 1. We used a stringent 1.0 FPKM cutoff that generated a list of genes with significant base level expression and fewer false positives than a lower expression level threshold. If we had decreased our threshold to 0.1 FPKM, we would have detected 975 more DETs; however, these genes are expressed at an extremely low level and their impact must be weighed against the increase in false positives. We chose a   FC=3.9) and regulator of G-protein signaling 22 (Rgs22; FC=3.8) in the Nrl −/− retina. We also identified Crx opposite strand transcript 1 (Crxos1; FC=4.1), which is exclusively expressed in the eye from the opposite strand of a key retinal transcription factor, cone-rod homeobox containing gene (Crx) [55]. An interesting new finding is the retinal expression of multiple genes from the Kelch-like family (Klhl3, 4,5,18,33,36), solute carrier family (>30 members), and zinc-finger protein family (>10 members). Mutations in at least one gene from each family have previously been associated with retinal disease: Klhl7 with autosomal dominant RP [56], Slc24A1 with autosomal-recessive congenital stationary night blindness [57], and Znf513 with autosomal-recessive retinitis pigmentosa (RP) [58].

DISCUSSION
Specific patterns of gene expression define the morphology and function of distinct cell types and tissues. Changes in gene expression are associated with complex biologic processes, including development, aging, and disease pathogenesis. Until recently, such investigations focused on one or a few genes at a time. Advances in genomic technology have permitted simultaneous evaluation of most, if not all, genes that respond to an extrinsic microenvironment or intrinsic biologic program(s). Such studies are critical for delineating gene networks that can be targeted for treating specific diseases. RNA-seq allows comprehensive evaluation of transcriptomes, alternative transcripts, and coding polymorphisms. However, analyzing RNA-seq data has been challenging due to the complexity associated with quality control, sequence alignments, and handling of large data sets [59]. Several algorithms [45,60] have been proposed for mapping sequence reads to the reference genome, and multiple workflows [16,50] suggested for RNA-seq data analysis. Here, we report a detailed RNA-seq methodology using WT and Nrl −/− retinas as a study paradigm and establish the high performance of NGS technology compared to microarray and qRT-PCR platforms for transcript identification and quantification studies. Consistent with recent studies [61], our RNA-seq data demonstrate high sensitivity, a wider dynamic range of coverage, and lower technical variability.
Quantitative RT-PCR has long been considered the "gold standard" for mRNA quantification [62,63], and hence routinely used to validate results from transcriptome analysis studies. We show that FPKM values from RNA-seq analysis have a strong linear correlation across at least four orders of magnitude with Ct values from qRT-PCR. Expression of  As photoreceptors constitute almost 70% of cells in P21 WT and Nrl −/− retina, the high expressed genes (in bold) likely encode proteins associated with general photoreceptor function/metabolism. WT=wild type Figure 5. Heatmaps and hierarchical clusters of differentially expressed rod-specific genes and cone-specific genes or those involved in retinal remodeling. Heatmaps with dendrograms of clusters of differentially expressed rod genes (A) and cone / retinal remodeling genes (B) by applying hierarchical clustering. A filtered list of mRNA transcript isoforms was further revised for fold change ≥1.5 and p-value <0.05, and duplicate gene symbol rows were deleted to retain the most expressed isoform as reflective of the gene. This list was used to generate the heatmap and the master cluster. Specific clusters of rod specific genes and cone-specific or retinal remodeling genes were identified as clusters containing known rod genes (e.g., Rhodopsin  Many of the genes down-regulated in Nrl −/− retina represent rod-specific transcripts, whereas upregulated genes include those associated with cone function and retinal remodeling. Many of the genes down-regulated in Nrl −/− retina represent rod-specific transcripts, whereas upregulated genes include those associated with cone function and retinal remodeling. several HKGs is underestimated by RNA-seq because of the algorithmic limitation associated with alignment of reads that map to multiple genomic locations (paralogous sequences or pseudogenes). All of the outlying HKGs inspected had a lower-than-projected FPKM value due to varying numbers of associated pseudogenes [64][65][66][67]. For example, Gapdh has 331 pseudogenes in the mouse genome [64]. Our qRT-PCR data projected an FPKM value of approximately 4000 for Gapdh, yet the BWA workflow estimated an FPKM of only 6.6 in the WT retina (see Figure 3). This was also the case, but less severe, for Pgk1, Rps26, Rpl13a, and ActB. Current algorithms proportionally divide the number of reads aligning to multiple genes during FPKM calculation among those genes. In our study, unsuitable qRT-PCR assay design explains the remaining exceptions to the linear correlation between qRT-PCR and RNA-seq. After careful visual inspection of the aligned reads in IGV, we found that the assay designed for Wisp1 was not specific to the splice variant expressed in the retina. Similarly, the assays designed for Cadm3 and Ubc were specific to one of the two transcripts expressed in the retina. Hence, RNA-seq provides a better assessment of alternate isoforms, and transcript quantification is not limited by the design of qRT-PCR assays.
We took advantage of the RNA-seq data to inspect the expression of commonly used HKGs (see Table 5) for normalization in qRT-PCR assays. The choice generally depends on specific tissue and/or developmental time points being investigated. Our RNA-seq studies suggest that most of the HKGs can be used for normalization calculation in qRT-PCR assays; however, Gapdh, β-2 microglobulin (B2m), and Ubc do not appear to be good choices. Additional RNA-seq data would help in delineating relevant HKGs appropriate for qPCR validation in developing retina or cell types.
We compared two different strategies for analyzing WT and Nrl −/− RNA-seq data. The BWA workflow relies on fast and accurate gapped alignment of reads to the exonic regions of the genome. Since the gap between most adjacent exons is larger than a few bases, the cumulative gap extension penalty underestimates the quality of the alignment of reads spanning the splice junctions. Hence, the BWA workflow produced accurate quantitative estimation of gene and transcript isoform expression while losing valuable information about Figure 6. Comparison of the current and previous data sets of differentially expressed transcripts of Nrl −/− versus wild type (WT) mouse retina. Overlap between the differentially expressed transcripts (DETs) identified in the current study and previous studies using mouse retinas from the same age and genotype was determined using the Mouse Genome Informatics (MGI) gene symbol as the identifier. The current study includes all mRNA transcripts identified with the Burrows-Wheeler Aligner (BWA) workflow (fold change ≥1.5 and p value <0.05). The 438 DETs of an Affymetrix microarray study [38] and 6,123 DETs from another RNA-seq study [54]   Differential expression of genes identified in three global profiling studies (Corbo, Mustafi and this report) was validated by qRT-PCR. Genes shown under the Corbo and Mustafi subheadings were identified in respective studies, but not in the current study. Corbo-Mustafi subheading indicates genes identified in both Corbo and Mustafi studies but not in the current report. The two genes, Cd8a and Acoxl, identified previously were differentially expressed by qRT-PCR but were filtered out of our sequencing data because of the low FPKM values observed. The gene Abca13 was detected as only expressing the last 7 exons of the annotated sequence. qRT-PCR assays that distinguished between the two isoforms were able to confirm the differential expression of the last 7 Abca13 exons. This gene was filtered out of our sequencing results because of the low FPKM value observed. All genes included in this report subheading were uniquely identified by the current study (and not in Corbo and Mustafi reports). FC, p-val, and FPKM values in this report are derived from the sequencing results reported here. Genes were selected as having the largest fold changes in each category and from current annotation in the latest mm9 Refseq build. All differentially expressed genes identified in this report could be validated by qRT-PCR. FC=fold change, NA=not applicable, FPKM=fragments per kilobase exon model per million reads. alternate splicing. The higher accuracy of quantitative gene expression estimates by the BWA workflow compared to those by TopHat is evident from the stronger correlation determined by linear regression analysis of the DETs. The regression line from BWA had a slope of −1.056 (compared to −0.905 for TopHat) and R 2 of 0.8798 (compared to 0.7727 for TopHat).
The TopHat workflow maps the reads to exonic regions of the reference genome as well as across all known and putative splice junctions defined in the Ensembl GTF file. TopHat attempts to map reads across splice junctions defined in the Ensembl GTF file and across novel splice junctions detected during the first phase of alignment. Hence, the TopHat workflow maps significantly more reads starting with the same number of pass filter (PF) reads and detects additional transcript isoforms missed by the BWA workflow. The source of genomic annotations used by these methods is another important difference. UCSC refFlat annotation (used by the BWA workflow) for the mouse reference genome (build mm9) contained approximately 28,000 unique transcript isoforms, whereas the Ensembl GTF file (used by the TopHat workflow) for the same genome build listed three times more unique transcript isoforms. The problem is amplified because of the lack of one-to-one mapping for several transcripts defined in the UCSC refFlat file in Ensembl GTF. Hence, a non-trivial number of DETs detected by the BWA workflow could not be mapped to any DET from the TopHat workflow (see Figure 2, regions shaded in green).
The BWA workflow detects about 16,000 transcripts in the retina, with a minimum expression equivalent to one transcript per cell (i.e., 1 FPKM) [16]. When the criteria were relaxed to cover transcripts expressed at low levels (0.1 FPKM), 20,707 transcripts were detected in the retina. This is not surprising as the whole retina includes more than 50 distinct neuronal cell types, and each cell would achieve protein diversity largely by alternative promoter usage and/or alternative splicing [68]. The TopHat workflow yields thousands of known and putative transcript isoforms not previously described in the retina. However, validating these novel isoforms predicted from RNA-seq data remains a challenge.
Integrated analysis of RNA-seq data with miRNA-seq, transcription factor binding sites data (chromatin immunoprecipitation sequencing-Chip-Seq), genetic variations (expression Quantitative Trait Loci) [69], and methylation patterns would allow decoding of the complex regulatory networks associated with retinal development and function. Several technical improvements would however be necessary to overcome the bias introduced into the RNA-seq data due to GC content, mappability of reads, length of the gene, and regional differences due to local sequence structure [70]. RNA-seq methods are more likely to identify longer differentially expressed transcripts than shorter transcripts with the same effect size [71]. New statistical methods are being developed to correct for systematic biases inherent in NGS data [70][71][72]. In the coming years, we will witness an explosion in high throughput genomic methods that are expected to revolutionize biology and biomedical discovery.